Syllables and other String Kernel

نویسندگان

  • Craig Saunders
  • Hauke Tschach
  • John Shawe-Taylor
چکیده

Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classiication problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too ne one resulting from considering individual letters only. We give some experimental results which show that syllables can be eeectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The rst introduces a new lambda-weighting scheme, where diierent symbols can be given diiering decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less signiicant. We also introduce the concept of`soft matching', where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overall similarity measure for two data items.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syllables and other String Kernel Extensions

During the last years, the use of string kernels that compare documents has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents compared to a character based representation and as a result reduces computation time. Moreover sylla...

متن کامل

The Spectrum Kernel: A String Kernel for SVM Protein Classification

We introduce a new sequence-similarity kernel, the spectrum kernel, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. Our kernel is conceptually simple and efficient to compute and, in experiments on the SCOP database, performs well in comparison with state-of-the-art methods for homology detection. Moreover, our method produces an S...

متن کامل

Position-Aware String Kernels with Weighted Shifts and a General Framework to Apply String Kernels to Other Structured Data

In combination with efficient kernel-base learning machines such as Support Vector Machine (SVM), string kernels have proven to be significantly effective in a wide range of research areas (e.g. bioinformatics, text analysis, voice analysis). Many of the string kernels proposed so far take advantage of simpler kernels such as trivial comparison of characters and/or substrings, and are classifie...

متن کامل

Learning state machine-based string edit kernels

During the past few years, several works have been done to derive string kernels from probability distributions. For instance, the Fisher kernel uses a generative model M (e.g. a hidden markov model) and compares two strings according to how they are generated by M . On the other hand, the marginalized kernels allow the computation of the joint similarity between two instances by summing condit...

متن کامل

String Subsequence Kernels for Text Classification

This paper explores the string subsequence kernel, a kernel function whose feature space is generated by subsequences of strings. This kernel compares two strings based on the number of occurrences of common substrings they contain, where each common substring is weighted based on how contiguous that substring is within the string. Although a recursive definition of the string subsequence kerne...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002